Donohue, et al.
OK! We are now ready to start analyzing our data!.
We have what is called panel data . This is a special type of longitudinal data. Longitudinal data are data measurements taken over time. Panel data are data repeatedly measured for for multiple panel members or individuals over time. This is in contrast with time series data, which measures one individual over time and cross sectional data, which measures multiple individuals at one point in time. In other words, panel data is a combination of both, in this case we measure multiple individuals over multiple time periods. In our case, we have measurements of violent crime and other variables for each state over many years. Therefore we are using measurements about the same states over time.
In a panel analysis there are \(N\) individual panel members and \(T\) time points.
There are two types of panels:
1. Balanced - At each time point (\(T\)), there are data points for each individual(\(N\)).
| 1977 |
Nevada |
| 1977 |
Alabama |
| 1977 |
Kansas |
| 1978 |
Nevada |
| 1978 |
Alabama |
| 1978 |
Kansas |
| 1979 |
Nevada |
| 1979 |
Alabama |
| 1979 |
Kansas |
- Unbalanced - There may be data points missing for some individuals (\(N\)) at some time points (\(T\)).
| 1977 |
Nevada |
| 1977 |
Alabama |
| 1978 |
Nevada |
| 1978 |
Alabama |
| 1979 |
Nevada |
| 1979 |
Alabama |
| 1979 |
Kansas |
Overall in a a balanced panel, we have \(n\) observations, where \(n = N*T\).
In an unbalanced panel, the number of observations is less than \(N*T\).
In our case we have:
\(N\) = 45 states (recall that we removed those who had adopted a RTC law before 1980)
\(T\) = 31 years (1980 - 2010)
In every year we have measurements for each state (as we just saw above), thus our panel is balanced.
So, our total observations \(n = 45*31\), thus \(n\) = 1395.
We will be performing a panel linear regression model analysis.
In such an analysis we will model our data according to this generic model:
\[Y_{it}=β_{0}+β_{1}X_{1it}+...+β_{K}X_{Kit}\]
Where \(i\) is the individual dimension (in our case individual states) and \(t\) is the time dimension.
Some explanatory/independent variables or regressors \(X_{it}\) will vary across individuals and time, while others will be fixed across the time of the study (or don’t change over time), while others still will be fixed across individuals but vary across time periods.
There are three general sub-types of panel regression analysis.
Overall, they assume that the different individuals are independent, however the same data for the same individual may be correlated across time.
The main difference between the three sub-types are the assumptions about unobserved differences between individuals.
- independently pooled panels - assumes that there are no individual effects that are independent of time and also no effect of time on all the individuals. In other words, the independent variables are not correlated with any error term. This is essentially an ordinary least squares linear regression. This type of panel regression makes the most assumptions and is therefore typically not used for panel data.
\[Y_{it}=\beta_{0}+\beta_{1}x_{1it}+...+\beta__KX_{Kit} + e_{it}\] Where the intercept \(\beta_0it=\beta_0\)for all \(i,t\) and slope \(\beta_kit=\beta_k\) for all \(i,t\).
- fixed effects - assumes that there are unknown or unobserved unique aspects about the individuals or heterogeneity among individuals \(B_0i\) that are not explained by the independent variables but influence the outcome variable of interest. They do not vary with time or in other words are fixed over time but may be correlated with independent variables \(X_{it}\).
In this case the intercept can be different for each individual \(\beta_{0i}\), but the slope is the same across all the individuals.
These individual \(a_i\) effects can be correlated with the independent variables \(X\).
\[Y_{it}=\beta_{0i}+\beta_{1}X_{1it}+...\beta_{k}X_{kit}+e_it\] This type of panel regression makes the least assumptions.
- random effects - assumes that there are unknown or unobserved unique qualities about the individuals that influences the outcome variable of interest that are not correlated with the independent variables.
Thus, the random effects model actually makes more assumptions than the fixed effect model.
\[Y_{it} =\beta_0 X_{it} +\beta_{1}X_{1it}+... \beta_k X_{kt} + e_{t}\] So each individual has the same slope and the same overall error term (\(\beta + e_{t}\)).
See here and here and here for more information about these different models.
We will be performing a fixed effect panel regression analysis, as we do in fact think that some of the unobserved qualities about the different states that may be correlated with some of our independent variables. For example, the level of economic opportunity might be an unobserved feature about the states that influences violent crime rate and would be possibly correlated with poverty rate and unemployment.
To perform our analysis we will be using the plm package. This stands for Panel Linear Model.
We need to use a special type of data to use this package, called a pdata.frame which is short for panel data frame. This allows us to specify that we are using panel data and what the panel structure looks like.
We need to indicate variable should be used to identify the individuals in our panel, and what variable should be used to identify the time periods in our panel. In our case the STATE variable identifies the individuals and the YEAR variable identifies the time periods.
We can specify this structure using the pdata.frame() function of the plm package, by using the index argument, where the individual variable is specified first followed by the time variable, like so: `index=c(“Individual_Variable_NAME”, “Time_Period_Variable_NAME”).
[1] "pdata.frame" "data.frame"
YEAR STATE Black_Male_15_to_19_years Black_Male_20_to_39_years
Alaska-1980 1980 Alaska 0.1670456 0.9933775
Alaska-1981 1981 Alaska 0.1732299 1.0028219
Alaska-1982 1982 Alaska 0.1737069 1.0204445
Other_Male_15_to_19_years Other_Male_20_to_39_years
Alaska-1980 1.129782 2.963329
Alaska-1981 1.124441 2.974775
Alaska-1982 1.069821 3.015071
White_Male_15_to_19_years White_Male_20_to_39_years
Alaska-1980 3.627805 18.28852
Alaska-1981 3.558261 18.12821
Alaska-1982 3.391844 18.10666
Unemployment_rate Poverty_rate Viol_crime_count Population
Alaska-1980 9.6 9.6 1919 404680
Alaska-1981 9.4 9.0 2537 418519
Alaska-1982 9.9 10.6 2732 449608
police_per_100k_lag RTC_LAW_YEAR RTC_LAW TIME_0 TIME_INF
Alaska-1980 194.7218 1995 FALSE 1980 2010
Alaska-1981 200.2299 1995 FALSE 1980 2010
Alaska-1982 191.0553 1995 FALSE 1980 2010
Viol_crime_rate_1k Viol_crime_rate_1k_log Population_log
Alaska-1980 4.742018 1.556463 12.91085
Alaska-1981 6.061851 1.802015 12.94448
Alaska-1982 6.076404 1.804413 13.01613
Indeed we have now created a pdata.frame object and we can see that the row names show the individual states and time period years.
OK, now we are ready to run our panel linear model on our panel data frame.
To do so we will use the plm() function and we will specify the formula for our model, where the dependent variable Viol_crime_rate_1k_log will be on the left of our ~ sign and all of the independent variables will be listed on the right with + signs in between each.
We also need to specify what type of effect we would like to model and what type of model we would like to use.
There are three main options for the effect argument: 1) individual - model for the effect of individual identity 2) time - model for the effect of time 3) twoways - meaning modeling for the effect of both individual identity and time
There are four main options for the model argument:
1) pooling - standard pooled ordinary least squares regression model
2) within - fixed effects model (variation between individuals is ignored, model compares individuals to themselves at different periods of time)
3) between - fixed effects model (variation within individuals from one time point to another is ignored, model compares different individuals at each point of time)
4) random - random effects (each state has a different intercept but force it to follow a normal distribution - requires more assumptions)
Typically it is best to think about what you are trying to evaluate with your data in trying to choose how to model your data. However, there are also some tests that can help to assess this which we will briefly cover.
We are interested in how violence in each state varied over time, thus we are interested in within STATEvariation, so we will perform our plm with the model = within argument to perform this particular type of fixed effects model.
We also speculate that there is an effect of individual STATE identity and time on violent crime rate. In other words, we expect some states to have high rates of crime, and others to have low rates of crime. We also expect crime to change over time.
If we were to perform this type of analysis we would use the effect = "twoways" argument in our plm() function like so:
To see the results we can use the base summary() function. We can view this output in tidy format using the tidy() function of the broom package.
We will add an analysis variable is a label for plots.
Twoways effects Within Model
Call:
plm(formula = Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years +
White_Male_20_to_39_years + Black_Male_15_to_19_years + Black_Male_20_to_39_years +
Other_Male_15_to_19_years + Other_Male_20_to_39_years + Unemployment_rate +
Poverty_rate + Population_log + police_per_100k_lag, data = d_panel_DONOHUE,
effect = "twoways", model = "within")
Balanced Panel: n = 45, T = 31, N = 1395
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.57957437 -0.08942194 -0.00090654 0.08673054 1.11216999
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE 0.01796779 0.01663911 1.0799 0.2804066
White_Male_15_to_19_years -0.00091825 0.02724210 -0.0337 0.9731160
White_Male_20_to_39_years 0.03466473 0.00972839 3.5633 0.0003794 ***
Black_Male_15_to_19_years -0.05673593 0.05746052 -0.9874 0.3236341
Black_Male_20_to_39_years 0.12605439 0.01931450 6.5264 9.607e-11 ***
Other_Male_15_to_19_years 0.69201638 0.11322394 6.1119 1.297e-09 ***
Other_Male_20_to_39_years -0.30276797 0.03811855 -7.9428 4.226e-15 ***
Unemployment_rate -0.01685806 0.00489952 -3.4408 0.0005984 ***
Poverty_rate -0.00780235 0.00295720 -2.6384 0.0084280 **
Population_log -0.17991653 0.06041773 -2.9779 0.0029559 **
police_per_100k_lag 0.00060391 0.00013689 4.4115 1.111e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 43.211
Residual Sum of Squares: 36.716
R-Squared: 0.15031
Adj. R-Squared: 0.095138
F-statistic: 21.0514 on 11 and 1309 DF, p-value: < 2.22e-16
# A tibble: 11 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 RTC_LAWTRUE 0.0180 0.0166 1.08 2.80e- 1 -1.46e-2 0.0506
2 White_Male_15_to_1… -0.000918 0.0272 -0.0337 9.73e- 1 -5.43e-2 0.0525
3 White_Male_20_to_3… 0.0347 0.00973 3.56 3.79e- 4 1.56e-2 0.0537
4 Black_Male_15_to_1… -0.0567 0.0575 -0.987 3.24e- 1 -1.69e-1 0.0559
5 Black_Male_20_to_3… 0.126 0.0193 6.53 9.61e-11 8.82e-2 0.164
6 Other_Male_15_to_1… 0.692 0.113 6.11 1.30e- 9 4.70e-1 0.914
7 Other_Male_20_to_3… -0.303 0.0381 -7.94 4.23e-15 -3.77e-1 -0.228
8 Unemployment_rate -0.0169 0.00490 -3.44 5.98e- 4 -2.65e-2 -0.00726
9 Poverty_rate -0.00780 0.00296 -2.64 8.43e- 3 -1.36e-2 -0.00201
10 Population_log -0.180 0.0604 -2.98 2.96e- 3 -2.98e-1 -0.0615
11 police_per_100k_lag 0.000604 0.000137 4.41 1.11e- 5 3.36e-4 0.000872
We will now perform a test to determine if we could have simply used a pooled model. This test evaluates if the coefficients (including the intercepts) are equal. To perform this test we will use the pooltest() function of the plm package to compare the pooled model to the fixed effect within model.
F statistic
data: Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years + ...
F = 15.299, df1 = 484, df2 = 855, p-value < 2.2e-16
alternative hypothesis: unstability
The p-value is less than a significance threshold of .05, thus we reject the null that our coefficients are all equal. Thus the within fixed effects model fit the data better.
We can also perform a test to evaluate if there is indeed an individual effect and a time effect in our model.
We can use the plmtest() function of the plm package. This performs a Lagrange Multiplier Test. To do so, we need to get the output for a simple pooled model.
Lagrange Multiplier Test - two-ways effects (Honda) for balanced
panels
data: Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years + ...
normal = 72.56, p-value < 2.2e-16
alternative hypothesis: significant effects
Again, the p-value is much smaller than the significance threshold of < 0.05. Therefore we reject the null that there are no effects, and we can feel confident with proceeding with a twoway effect model.
There is one more test that we could perform. To test if using a random effect model would be more appropriate compared to the fixed effect model, one could use the Hausmen test for this (also called the Durbin-Wu-Hausman test). This can be implemented using the phtest() function of the plm package.
Hausman Test
data: Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years + ...
chisq = 59.54, df = 11, p-value = 1.129e-08
alternative hypothesis: one model is inconsistent
This test evaluates if there are errors \(u_i\) that are correlated with any of the independent variables.
We reject the null hypothesis, that there are no inconsistencies, and are confirmed in our plan to used a fixed effects model.
For more information on these tests and this package, see here and here.
avocado- what do you think about referencing these slides from Princeton?
A final note about the fixed and random effects terminology and how this is slightly different than other definitions:
According to the documentation for the PLM package: >The fixed/random effects terminology in econometrics is often recognized to be misleading, as both are treated as random variates in modern econometrics (see, e.g., Wooldridge (2002) 10.2.1). It has been recognized since Mundlak’s classic paper (Mundlak (1978)) that the fundamental issue is whether the unobserved effects are correlated with the regressors or not. In this last case, they can safely be left in the error term, and the serial correlation they induce is cared for by means of appropriate GLS transformations. On the contrary, in the case of correlation, “fixed effects” methods such as least squares dummy variables or time-demeaning are needed, which explicitly, although inconsistently27, estimate a group– (or time–) invariant additional parameter for each group (or time period).
Thus, from the point of view of model specification, having fixed effects in an econometric model has the meaning of allowing the intercept to vary with group, or time, or both, while the other parameters are generally still assumed to be homogeneous. Having random effects means having a group– (or time–, or both) specific component in the error term.
In the mixed models literature, on the contrary, fixed effect indicates a parameter that is assumed constant, while random effects are parameters that vary randomly around zero according to a joint multivariate normal distribution.